IBM - Exploratory Data Analysis for Machine Learning Assignment

  • Brief description of the data set and a summary of its attributes

    • The dataset contains 1470 observations, 34 features, and a target value which name is 'Attrition'. Furthermore, the aforemention dataset was clean and contained no NULL values. Moreover, a data dictionary was provided in order to grasp the context of individual labels; such as Education, Environment_Satisfaction, Job_Involvement, Job_Satisfaction, Performance_Rating, Relationship_Satisfaction, Work_Life_Balance, Job_Level, and Stock_Option_Level being on a scale from 1-5, while Distance_From_Home was measured in Kilometers and Percent_Salary_Hike is the percent increase in salary compared to the previous year.
  • Initial plan for data exploration

    • Check for missing data (NULLS).
    • Review columns to identify what is needed or not.
    • Review column names and understand what each label means.
    • Make plots for initial insights.
  • Actions taken for data cleaning and feature engineering

    • Clean column names.
    • Review target variable against other features ('Attrition' vs x) and create plots against 'Attrition'.
    • Remove columns: 'EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber'
  • Key Findings and Insights, which synthesizes the results of Exploratory Data Analysis in an insightful and actionable manner

    • 'Attrition' was at 66.67% for 19 year olds, however the 'Attrition' rate drastically drops until employees hit the age of 58, where 'Attrition' jumps up to 35.71%.
    • When 'Distance_From_Home' was 25 kilometers the "Attrition' rate 42.85%.
    • When 'Percent_Salary_Hike' was 24 (24% increase from previous year) 'Attrition' rate was 28.57%.
    • When 'Total_Working_Years' was 0 the 'Attrition' rate was 45.45%, when 1 the 'Attrition' rate was 49.38% and at 40 100%, likely due to retirement.
    • When 'Training_Time_Last_Year' was 0 the 'Attrition' rate was 27.78%.
    • When 'Year_At_Company' was 0 the 'Attrition' rate was 36.36%, when 1 34.50%.
    • Interestingly when an employee was given 'Overtime' their 'Attrition' rate was 30.52% compared to 10.43%.
    • When an employee 'Travel_Frequently' the 'Attrition" rate was 24.90%.
    • Not suprisingly the 'Job_Role' of Sales Rep with the highest 'Attrition' rate was 39.75%.
    • Employees with 'Marital_Status' single has an 'Attrition' rate of 25.53.
    • 'Education_Field' of Human Resources has an 'Attrition' rate of 25.92%.
  • Formulating at least 3 hypothesis about this data

      - Ho: µ Education_Attrition == µ Education_Not_Attrition
      - Ha: µ Education_Attrition != µ Education_Not_Attrition
    
      - Ho: µ Age_Attrition == µ Age_Not_Attrition
      - Ha: µ Age_Attrition != µ Age_Not_Attrition
    
      - Ho: µ Job_Satisfaction_Attrition == µ Job_Satisfaction_Not_Attrition
      - Ha: µ Job_Satisfaction_Attrition != µ Job_Satisfaction_Not_Attrition
  • Conducting a formal significance test for one of the hypotheses and discuss the results

    • ss.kruskal(attrition_df['Education'], not_attrition_df['Education']) KruskalResult(statistic=1.3527640913093548, pvalue=0.2447954753326153)

      pvalue > 0.05

      There appears to be no statistically significant relationship between Attrition and Education, thus we fail reject the null hypothesis.

  • Suggestions for next steps in analyzing this data

    • Continue testing hypothesis to hone in on and uncovering what the true indicators are for Attrition.
    • Would like to model to uncover which feature is attributed most to Attrition.
  • A paragraph that summarizes the quality of this data set and a request for additional data if needed

    • Once again, this dataset was Immaculately clean. Additionally, a data dictionary was provided in order to grasp the context of each individual column. Moreover, additional data that would provide more insight would be the reason of Attrition. Did the individual seek further education, did they pusue a job within a completely different industry, or did they make a lateral move.

Data Dictionary

Education: 1 'Below College', 2 'College', 3 'Bachelor', 4 'Master', 5 'Doctor'

Environment_Satisfaction: 1 'Low', 2 'Medium', 3 'High', 4 'Very High'

Job_Involvement: 1 'Low', 2 'Medium', 3 'High', 4 'Very High'

Job_Satisfaction: 1 'Low', 2 'Medium', 3 'High', 4 'Very High'

Performance_Rating: 1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding'

Relationship_Satisfaction: 1 'Low', 2 'Medium', 3 'High', 4 'Very High'

Work_Life_Balance: 1 'Bad', 2 'Good', 3 'Better', 4 'Best'

Distance_From_Home: Measured in Kilometers

Stock_Option_Level: Job_Level Scale

Job_Level: 1 - 5 scale

Percent_Salary_Hike: Percentage increase compared to the previous year

Importing Libraries & Dataset

In [3]:
import pandas as pd
pd.set_option('display.max_columns', None)
from pandas_profiling import ProfileReport

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff

import scipy.stats as ss
import sklearn as skl
from scipy import stats
In [5]:
df = pd.read_csv('IBM_HR.csv')
print(df.shape)
df.head()
(1470, 35)
Out[5]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Y Yes 11 3 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 Y No 23 4 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Y Yes 15 3 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Y Yes 11 3 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 Y No 12 3 4 80 1 6 3 3 2 2 2 2

EDA

In [8]:
df.isnull().any()
Out[8]:
Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesLastYear       False
WorkLifeBalance             False
YearsAtCompany              False
YearsInCurrentRole          False
YearsSinceLastPromotion     False
YearsWithCurrManager        False
dtype: bool
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
In [20]:
df.describe().T
Out[20]:
count mean std min 25% 50% 75% max
Age 1470.0 36.923810 9.135373 18.0 30.00 36.0 43.00 60.0
DailyRate 1470.0 802.485714 403.509100 102.0 465.00 802.0 1157.00 1499.0
DistanceFromHome 1470.0 9.192517 8.106864 1.0 2.00 7.0 14.00 29.0
Education 1470.0 2.912925 1.024165 1.0 2.00 3.0 4.00 5.0
EmployeeCount 1470.0 1.000000 0.000000 1.0 1.00 1.0 1.00 1.0
EmployeeNumber 1470.0 1024.865306 602.024335 1.0 491.25 1020.5 1555.75 2068.0
EnvironmentSatisfaction 1470.0 2.721769 1.093082 1.0 2.00 3.0 4.00 4.0
HourlyRate 1470.0 65.891156 20.329428 30.0 48.00 66.0 83.75 100.0
JobInvolvement 1470.0 2.729932 0.711561 1.0 2.00 3.0 3.00 4.0
JobLevel 1470.0 2.063946 1.106940 1.0 1.00 2.0 3.00 5.0
JobSatisfaction 1470.0 2.728571 1.102846 1.0 2.00 3.0 4.00 4.0
MonthlyIncome 1470.0 6502.931293 4707.956783 1009.0 2911.00 4919.0 8379.00 19999.0
MonthlyRate 1470.0 14313.103401 7117.786044 2094.0 8047.00 14235.5 20461.50 26999.0
NumCompaniesWorked 1470.0 2.693197 2.498009 0.0 1.00 2.0 4.00 9.0
PercentSalaryHike 1470.0 15.209524 3.659938 11.0 12.00 14.0 18.00 25.0
PerformanceRating 1470.0 3.153741 0.360824 3.0 3.00 3.0 3.00 4.0
RelationshipSatisfaction 1470.0 2.712245 1.081209 1.0 2.00 3.0 4.00 4.0
StandardHours 1470.0 80.000000 0.000000 80.0 80.00 80.0 80.00 80.0
StockOptionLevel 1470.0 0.793878 0.852077 0.0 0.00 1.0 1.00 3.0
TotalWorkingYears 1470.0 11.279592 7.780782 0.0 6.00 10.0 15.00 40.0
TrainingTimesLastYear 1470.0 2.799320 1.289271 0.0 2.00 3.0 3.00 6.0
WorkLifeBalance 1470.0 2.761224 0.706476 1.0 2.00 3.0 3.00 4.0
YearsAtCompany 1470.0 7.008163 6.126525 0.0 3.00 5.0 9.00 40.0
YearsInCurrentRole 1470.0 4.229252 3.623137 0.0 2.00 3.0 7.00 18.0
YearsSinceLastPromotion 1470.0 2.187755 3.222430 0.0 0.00 1.0 3.00 15.0
YearsWithCurrManager 1470.0 4.123129 3.568136 0.0 2.00 3.0 7.00 17.0

We will drop ['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber'] columns cause their standard deviations are 0

In [10]:
clean_df = df.drop(columns=['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber'])
clean_df.head()
Out[10]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Yes 11 3 1 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 No 23 4 4 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Yes 15 3 2 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Yes 11 3 3 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 No 12 3 4 1 6 3 3 2 2 2 2
In [14]:
clean_df.Attrition.replace(to_replace = dict(Yes = 1, No = 0), inplace = True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-889a4fce0da5> in <module>
----> 1 clean_df.Attrition.replace(to_replace = dict(Yes = 1, No = 0), inplace = True)

~\anaconda3\lib\site-packages\pandas\core\series.py in replace(self, to_replace, value, inplace, limit, regex, method)
   4170         method="pad",
   4171     ):
-> 4172         return super().replace(
   4173             to_replace=to_replace,
   4174             value=value,

~\anaconda3\lib\site-packages\pandas\core\generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
   6643                 to_replace, value = keys, values
   6644 
-> 6645             return self.replace(
   6646                 to_replace, value, inplace=inplace, limit=limit, regex=regex
   6647             )

~\anaconda3\lib\site-packages\pandas\core\series.py in replace(self, to_replace, value, inplace, limit, regex, method)
   4170         method="pad",
   4171     ):
-> 4172         return super().replace(
   4173             to_replace=to_replace,
   4174             value=value,

~\anaconda3\lib\site-packages\pandas\core\generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
   6693                         )
   6694 
-> 6695                     new_data = self._data.replace_list(
   6696                         src_list=to_replace,
   6697                         dest_list=value,

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in replace_list(self, src_list, dest_list, inplace, regex)
    611             return _compare_or_regex_search(values, s, regex)
    612 
--> 613         masks = [comp(s, regex) for i, s in enumerate(src_list)]
    614 
    615         result_blocks = []

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in <listcomp>(.0)
    611             return _compare_or_regex_search(values, s, regex)
    612 
--> 613         masks = [comp(s, regex) for i, s in enumerate(src_list)]
    614 
    615         result_blocks = []

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in comp(s, regex)
    609                     maybe_convert_objects(values), s.asm8, regex
    610                 )
--> 611             return _compare_or_regex_search(values, s, regex)
    612 
    613         masks = [comp(s, regex) for i, s in enumerate(src_list)]

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in _compare_or_regex_search(a, b, regex)
   1933             type_names[1] = f"ndarray(dtype={b.dtype})"
   1934 
-> 1935         raise TypeError(
   1936             f"Cannot compare types {repr(type_names[0])} and {repr(type_names[1])}"
   1937         )

TypeError: Cannot compare types 'ndarray(dtype=int64)' and 'str'
In [16]:
clean_df["Attrition"]
Out[16]:
0       1
1       0
2       1
3       0
4       0
       ..
1465    0
1466    0
1467    0
1468    0
1469    0
Name: Attrition, Length: 1470, dtype: int64
In [17]:
header = ['Age', 'Attrition', 'Business_Travel', 'Daily_Rate',
         'Department', 'Distance_From_Home', 'Education', 'Education_Field',
         'Environment_Satisfaction', 'Gender', 
         'Hourly_Rate', 'Job_Involvement', 'Job_Level', 'Job_Role', 
         'Job_Satisfaction', 'Marital_Status', 'Monthly_Income',
         'Monthly_Rate', 'Num_Companies_Worked', 'Overtime', 
         'Percent_Salary_Hike', 'Performance_Rating', 'Relationship_Satisfaction', 
          'Stock_Option_Level','Total_Working_Years', 
          'Training_Times_Last_Year', 'Work_Life_Balance', 'Years_At_Company',
         'Years_In_Current_Role', 'Years_Since_Last_Promotion', 
         'Years_With_Current_Manager']
In [18]:
clean_df.columns = header
In [19]:
clean_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 31 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Age                         1470 non-null   int64 
 1   Attrition                   1470 non-null   int64 
 2   Business_Travel             1470 non-null   object
 3   Daily_Rate                  1470 non-null   int64 
 4   Department                  1470 non-null   object
 5   Distance_From_Home          1470 non-null   int64 
 6   Education                   1470 non-null   int64 
 7   Education_Field             1470 non-null   object
 8   Environment_Satisfaction    1470 non-null   int64 
 9   Gender                      1470 non-null   object
 10  Hourly_Rate                 1470 non-null   int64 
 11  Job_Involvement             1470 non-null   int64 
 12  Job_Level                   1470 non-null   int64 
 13  Job_Role                    1470 non-null   object
 14  Job_Satisfaction            1470 non-null   int64 
 15  Marital_Status              1470 non-null   object
 16  Monthly_Income              1470 non-null   int64 
 17  Monthly_Rate                1470 non-null   int64 
 18  Num_Companies_Worked        1470 non-null   int64 
 19  Overtime                    1470 non-null   object
 20  Percent_Salary_Hike         1470 non-null   int64 
 21  Performance_Rating          1470 non-null   int64 
 22  Relationship_Satisfaction   1470 non-null   int64 
 23  Stock_Option_Level          1470 non-null   int64 
 24  Total_Working_Years         1470 non-null   int64 
 25  Training_Times_Last_Year    1470 non-null   int64 
 26  Work_Life_Balance           1470 non-null   int64 
 27  Years_At_Company            1470 non-null   int64 
 28  Years_In_Current_Role       1470 non-null   int64 
 29  Years_Since_Last_Promotion  1470 non-null   int64 
 30  Years_With_Current_Manager  1470 non-null   int64 
dtypes: int64(24), object(7)
memory usage: 356.1+ KB
In [21]:
clean_df.describe()
Out[21]:
Age Attrition Daily_Rate Distance_From_Home Education Environment_Satisfaction Hourly_Rate Job_Involvement Job_Level Job_Satisfaction Monthly_Income Monthly_Rate Num_Companies_Worked Percent_Salary_Hike Performance_Rating Relationship_Satisfaction Stock_Option_Level Total_Working_Years Training_Times_Last_Year Work_Life_Balance Years_At_Company Years_In_Current_Role Years_Since_Last_Promotion Years_With_Current_Manager
count 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000 1470.000000
mean 36.923810 0.161224 802.485714 9.192517 2.912925 2.721769 65.891156 2.729932 2.063946 2.728571 6502.931293 14313.103401 2.693197 15.209524 3.153741 2.712245 0.793878 11.279592 2.799320 2.761224 7.008163 4.229252 2.187755 4.123129
std 9.135373 0.367863 403.509100 8.106864 1.024165 1.093082 20.329428 0.711561 1.106940 1.102846 4707.956783 7117.786044 2.498009 3.659938 0.360824 1.081209 0.852077 7.780782 1.289271 0.706476 6.126525 3.623137 3.222430 3.568136
min 18.000000 0.000000 102.000000 1.000000 1.000000 1.000000 30.000000 1.000000 1.000000 1.000000 1009.000000 2094.000000 0.000000 11.000000 3.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 30.000000 0.000000 465.000000 2.000000 2.000000 2.000000 48.000000 2.000000 1.000000 2.000000 2911.000000 8047.000000 1.000000 12.000000 3.000000 2.000000 0.000000 6.000000 2.000000 2.000000 3.000000 2.000000 0.000000 2.000000
50% 36.000000 0.000000 802.000000 7.000000 3.000000 3.000000 66.000000 3.000000 2.000000 3.000000 4919.000000 14235.500000 2.000000 14.000000 3.000000 3.000000 1.000000 10.000000 3.000000 3.000000 5.000000 3.000000 1.000000 3.000000
75% 43.000000 0.000000 1157.000000 14.000000 4.000000 4.000000 83.750000 3.000000 3.000000 4.000000 8379.000000 20461.500000 4.000000 18.000000 3.000000 4.000000 1.000000 15.000000 3.000000 3.000000 9.000000 7.000000 3.000000 7.000000
max 60.000000 1.000000 1499.000000 29.000000 5.000000 4.000000 100.000000 4.000000 5.000000 4.000000 19999.000000 26999.000000 9.000000 25.000000 4.000000 4.000000 3.000000 40.000000 6.000000 4.000000 40.000000 18.000000 15.000000 17.000000

Target distribution (number and %)

In [22]:
attrition = clean_df[(clean_df['Attrition'] != 0)]
no_attrition = clean_df[(clean_df['Attrition'] == 0)]

#Total Count of Yes/No 'Attrition'
trace = go.Bar(x = (len(attrition), len(no_attrition)), y = ['Yes_attrition', 'No_attrition'], 
               orientation = 'h', opacity = 0.8, marker=dict(color=['gold', 'lightskyblue'],
                                                             line=dict(color='#000000',width=1.5)))

layout = dict(title =  'Count of attrition variable')
                    
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

#Percentage of Yes/No 'Attrition'
trace = go.Pie(labels = ['No_attrition', 'Yes_attrition'], values = clean_df['Attrition'].value_counts(), 
               textfont=dict(size=15), opacity = 0.8,
               marker=dict(colors=['lightskyblue','gold'], 
                           line=dict(color='#000000', width=1.5)))


layout = dict(title =  'Distribution of attrition variable')
           
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

Features distribution & barplot (hue = Attrition)

In [23]:
def plot_distribution(var_select, bin_size): 
# Calculate the correlation coefficient between the new variable and the target
    corr = clean_df['Attrition'].corr(clean_df[var_select])
    corr = np.round(corr,3)
    tmp1 = attrition[var_select]
    tmp2 = no_attrition[var_select]
    hist_data = [tmp1, tmp2]
    
    group_labels = ['Yes_attrition', 'No_attrition']
    colors = ['#FFD700', '#7EC0EE']

    fig = ff.create_distplot(hist_data, group_labels, colors = colors, show_hist = True, 
                             curve_type='kde', bin_size = bin_size)
    
    fig['layout'].update(title = var_select+' '+'(corr target ='+ str(corr)+')')

    py.iplot(fig, filename = 'Density plot')
In [24]:
def barplot(var_select, x_no_numeric) :
    tmp1 = clean_df[(clean_df['Attrition'] != 0)]
    tmp2 = clean_df[(clean_df['Attrition'] == 0)]
    tmp3 = pd.DataFrame(pd.crosstab(clean_df[var_select],clean_df['Attrition']), )
    tmp3['Attr%'] = tmp3[1] / (tmp3[1] + tmp3[0]) * 100
    if x_no_numeric == True  : 
        tmp3 = tmp3.sort_values(1, ascending = False)

    color=['lightskyblue','gold' ]
    trace1 = go.Bar(
        x=tmp1[var_select].value_counts().keys().tolist(),
        y=tmp1[var_select].value_counts().values.tolist(),
        name='Yes_Attrition',opacity = 0.8, marker=dict(
        color='gold',
        line=dict(color='#000000',width=1)))

    
    trace2 = go.Bar(
        x=tmp2[var_select].value_counts().keys().tolist(),
        y=tmp2[var_select].value_counts().values.tolist(),
        name='No_Attrition', opacity = 0.8, marker=dict(
        color='lightskyblue',
        line=dict(color='#000000',width=1)))
    
    trace3 =  go.Scatter(   
        x=tmp3.index,
        y=tmp3['Attr%'],
        yaxis = 'y2',
        name='% Attrition', opacity = 0.6, marker=dict(
        color='black',
        line=dict(color='#000000',width=0.5
        )))

    layout = dict(title =  str(var_select),
              xaxis=dict(), 
              yaxis=dict(title= 'Count'), 
              yaxis2=dict(range= [-0, 75], 
                          overlaying= 'y', 
                          anchor= 'x', 
                          side= 'right',
                          zeroline=False,
                          showgrid= False, 
                          title= '% Attrition'
                         ))

    fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)
    py.iplot(fig)
In [25]:
plot_distribution('Age', False)
barplot('Age', False)
plot_distribution('Daily_Rate', 100)
plot_distribution('Distance_From_Home', False)
barplot('Distance_From_Home', False)
plot_distribution('Hourly_Rate', False)
plot_distribution('Monthly_Income', 100)
plot_distribution('Monthly_Rate', 100)
plot_distribution('Num_Companies_Worked', False)
barplot('Num_Companies_Worked',False)
plot_distribution('Percent_Salary_Hike', False)
barplot('Percent_Salary_Hike', False) 
plot_distribution('Total_Working_Years', False)
barplot('Total_Working_Years', False)
plot_distribution('Training_Times_Last_Year', False)
barplot('Training_Times_Last_Year',False)
plot_distribution('Years_At_Company', False)
barplot('Years_At_Company', False)
plot_distribution('Years_In_Current_Role', False)
barplot('Years_In_Current_Role', False)
plot_distribution('Years_Since_Last_Promotion', False)
barplot('Years_Since_Last_Promotion', False)
plot_distribution('Years_With_Current_Manager', False)
barplot('Years_With_Current_Manager', False)

Pie plot and barplot

In [26]:
def plot_pie(var_select) :
    
    colors = ['gold', 'lightgreen', 'lightcoral', 'lightskyblue', 'lightgrey', 'orange', 'white', 'lightpink']
    trace1 = go.Pie(values  = attrition[var_select].value_counts().values.tolist(),
                    labels  = attrition[var_select].value_counts().keys().tolist(),
                    textfont=dict(size=15), opacity = 0.8,
                    hoverinfo = "label+percent+name",
                    domain  = dict(x = [0,.48]),
                    name    = "attrition employes",
                    marker  = dict(colors = colors, line = dict(width = 1.5)))
    trace2 = go.Pie(values  = no_attrition[var_select].value_counts().values.tolist(),
                    labels  = no_attrition[var_select].value_counts().keys().tolist(),
                    textfont=dict(size=15), opacity = 0.8,
                    hoverinfo = "label+percent+name",
                    marker  = dict(colors = colors, line = dict(width = 1.5)),
                    domain  = dict(x = [.52,1]),
                    name    = "Non attrition employes" )

    layout = go.Layout(dict(title = var_select + " distribution in employes attrition ",
                            annotations = [dict(text = "Yes_attrition",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .22, y = -0.1),
                                            dict(text = "No_attrition",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .8,y = -.1)]))
                                          

    fig  = go.Figure(data = [trace1,trace2],layout = layout)
    py.iplot(fig)
In [27]:
plot_pie("Gender")
barplot('Gender',True)
plot_pie('Overtime')
barplot('Overtime',True)
plot_pie('Business_Travel')
barplot('Business_Travel',True)
plot_pie('Job_Role')
barplot('Job_Role',True)
plot_pie('Department') 
barplot('Department',True)
plot_pie('Marital_Status') 
barplot('Marital_Status',True)
plot_pie('Education_Field') 
barplot('Education_Field',True)
plot_pie('Education') 
barplot('Education',False)
plot_pie('Environment_Satisfaction')
barplot('Environment_Satisfaction',False)
plot_pie('Job_Involvement')
barplot('Job_Involvement', False)
plot_pie('Job_Level')
barplot('Job_Level',False)
plot_pie('Job_Satisfaction')
barplot('Job_Satisfaction',False)
plot_pie('Performance_Rating')
barplot('Performance_Rating',False)
plot_pie('Relationship_Satisfaction')
barplot('Relationship_Satisfaction', False)
plot_pie('Stock_Option_Level')
barplot('Stock_Option_Level', False)
plot_pie('Work_Life_Balance')
barplot('Work_Life_Balance', False)
In [28]:
Education_sum = clean_df.groupby(['Education']).sum()
In [29]:
Education_sum
Out[29]:
Age Attrition Daily_Rate Distance_From_Home Environment_Satisfaction Hourly_Rate Job_Involvement Job_Level Job_Satisfaction Monthly_Income Monthly_Rate Num_Companies_Worked Percent_Salary_Hike Performance_Rating Relationship_Satisfaction Stock_Option_Level Total_Working_Years Training_Times_Last_Year Work_Life_Balance Years_At_Company Years_In_Current_Role Years_Since_Last_Promotion Years_With_Current_Manager
Education
1 5468 31 139811 1486 467 11056 448 305 476 958897 2585377 322 2632 540 463 125 1494 487 472 1103 666 325 616
2 10395 44 231831 2596 766 18675 767 576 781 1755914 4018485 729 4283 894 781 233 3021 803 780 1865 1104 563 1127
3 20866 99 447190 5287 1588 37548 1575 1183 1517 3727875 8055367 1543 8655 1798 1524 443 6441 1599 1561 3924 2456 1262 2341
4 15634 58 323112 3640 1053 26296 1087 849 1109 2719296 5684232 1227 6037 1251 1093 327 4949 1082 1111 3009 1759 936 1752
5 1915 5 37710 504 127 3285 136 121 128 397327 696801 138 751 153 126 39 676 144 135 401 232 130 225
In [30]:
Education = clean_df.groupby(['Attrition', 'Education']).mean()
In [31]:
Education
Out[31]:
Age Daily_Rate Distance_From_Home Environment_Satisfaction Hourly_Rate Job_Involvement Job_Level Job_Satisfaction Monthly_Income Monthly_Rate Num_Companies_Worked Percent_Salary_Hike Performance_Rating Relationship_Satisfaction Stock_Option_Level Total_Working_Years Training_Times_Last_Year Work_Life_Balance Years_At_Company Years_In_Current_Role Years_Since_Last_Promotion Years_With_Current_Manager
Attrition Education
0 1 32.956835 820.532374 8.496403 2.726619 64.899281 2.690647 1.863309 2.820144 5926.129496 15257.064748 1.884892 15.604317 3.187050 2.776978 0.820144 9.338129 2.827338 2.755396 6.856115 4.165468 1.956835 3.841727
2 37.126050 834.899160 8.890756 2.785714 65.819328 2.794118 2.138655 2.823529 6586.058824 14240.638655 2.432773 15.247899 3.172269 2.827731 0.886555 11.172269 2.915966 2.785714 7.063025 4.147059 2.126050 4.268908
3 37.348837 793.845666 9.025370 2.864693 66.213531 2.780127 2.156448 2.701903 6882.919662 14004.344609 2.619450 15.097252 3.135307 2.682875 0.813953 12.067653 2.839323 2.737844 7.412262 4.678647 2.321353 4.450317
4 39.691176 823.923529 8.770588 2.670588 66.067647 2.764706 2.197059 2.841176 7087.814706 14208.376471 3.102941 15.185294 3.144118 2.726471 0.873529 12.805882 2.747059 2.850000 7.626471 4.520588 2.226471 4.482353
5 40.348837 777.558140 10.348837 2.604651 66.302326 2.837209 2.581395 2.744186 8559.906977 14530.209302 2.953488 15.767442 3.209302 2.697674 0.813953 14.139535 3.000000 2.767442 8.209302 4.953488 2.837209 4.790698
1 1 28.612903 830.870968 9.838710 2.838710 65.645161 2.387097 1.483871 2.709677 4360.161290 14988.548387 1.935484 14.935484 3.129032 2.483871 0.354839 6.322581 3.032258 2.870968 4.838710 2.806452 1.709677 2.645161
2 35.431818 752.840909 10.909091 2.340909 68.409091 2.318182 1.522727 2.477273 4282.545455 14300.295455 3.409091 14.863636 3.159091 2.454545 0.500000 8.227273 2.477273 2.659091 4.181818 2.659091 1.295455 2.522727
3 32.323232 724.252525 10.282828 2.353535 62.919192 2.626263 1.646465 2.414141 4770.242424 14457.696970 3.070707 15.292929 3.181818 2.575758 0.585859 7.404040 2.585859 2.686869 4.222222 2.454545 1.656566 2.383838
4 36.879310 741.000000 11.344828 2.500000 66.086207 2.534483 1.758621 2.465517 5335.155172 14713.517241 2.965517 15.068966 3.137931 2.862069 0.517241 10.258621 2.551724 2.448276 7.172414 3.827586 3.086207 3.931034
5 36.000000 855.000000 11.800000 3.000000 86.800000 2.800000 2.000000 2.000000 5850.200000 14400.400000 2.200000 14.600000 3.000000 2.000000 0.800000 13.600000 3.000000 3.200000 9.600000 3.800000 1.600000 3.800000
In [32]:
clean_df.groupby(['Education', 'Attrition']).count()
Out[32]:
Age Business_Travel Daily_Rate Department Distance_From_Home Education_Field Environment_Satisfaction Gender Hourly_Rate Job_Involvement Job_Level Job_Role Job_Satisfaction Marital_Status Monthly_Income Monthly_Rate Num_Companies_Worked Overtime Percent_Salary_Hike Performance_Rating Relationship_Satisfaction Stock_Option_Level Total_Working_Years Training_Times_Last_Year Work_Life_Balance Years_At_Company Years_In_Current_Role Years_Since_Last_Promotion Years_With_Current_Manager
Education Attrition
1 0 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139
1 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31
2 0 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238 238
1 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44
3 0 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473 473
1 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99
4 0 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340 340
1 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
5 0 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43
1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

Hypothesis Testing

Test 1

  • Ho: µ Education_Attrition == µ Education_Not_Attrition
  • Ha: µ Education_Attrition != µ Education_Not_Attrition
In [33]:
attrition_df = clean_df[clean_df['Attrition'] == 1]
not_attrition_df = clean_df[clean_df['Attrition'] == 0]
In [34]:
attrition_df
Out[34]:
Age Attrition Business_Travel Daily_Rate Department Distance_From_Home Education Education_Field Environment_Satisfaction Gender Hourly_Rate Job_Involvement Job_Level Job_Role Job_Satisfaction Marital_Status Monthly_Income Monthly_Rate Num_Companies_Worked Overtime Percent_Salary_Hike Performance_Rating Relationship_Satisfaction Stock_Option_Level Total_Working_Years Training_Times_Last_Year Work_Life_Balance Years_At_Company Years_In_Current_Role Years_Since_Last_Promotion Years_With_Current_Manager
0 41 1 Travel_Rarely 1102 Sales 1 2 Life Sciences 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Yes 11 3 1 0 8 0 1 6 4 0 5
2 37 1 Travel_Rarely 1373 Research & Development 2 2 Other 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Yes 15 3 2 0 7 3 3 0 0 0 0
14 28 1 Travel_Rarely 103 Research & Development 24 3 Life Sciences 3 Male 50 2 1 Laboratory Technician 3 Single 2028 12947 5 Yes 14 3 2 0 6 4 3 4 2 0 3
21 36 1 Travel_Rarely 1218 Sales 9 4 Life Sciences 3 Male 82 2 1 Sales Representative 1 Single 3407 6986 7 No 23 4 2 0 10 4 3 5 3 0 3
24 34 1 Travel_Rarely 699 Research & Development 6 1 Medical 2 Male 83 3 1 Research Scientist 1 Single 2960 17102 2 No 11 3 3 0 8 2 3 4 2 1 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1438 23 1 Travel_Frequently 638 Sales 9 3 Marketing 4 Male 33 3 1 Sales Representative 1 Married 1790 26956 1 No 19 3 1 1 1 3 2 1 0 1 0
1442 29 1 Travel_Rarely 1092 Research & Development 1 4 Medical 1 Male 36 3 1 Research Scientist 4 Married 4787 26124 9 Yes 14 3 2 3 4 3 4 2 2 2 2
1444 56 1 Travel_Rarely 310 Research & Development 7 2 Technical Degree 4 Male 72 3 1 Laboratory Technician 3 Married 2339 3666 8 No 11 3 4 1 14 4 1 10 9 9 8
1452 50 1 Travel_Frequently 878 Sales 1 4 Life Sciences 2 Male 94 3 2 Sales Executive 3 Divorced 6728 14255 7 No 12 3 4 2 12 3 3 6 3 0 1
1461 50 1 Travel_Rarely 410 Sales 28 3 Marketing 4 Male 39 2 3 Sales Executive 1 Divorced 10854 16586 4 Yes 13 3 2 1 20 3 3 3 2 2 0

237 rows × 31 columns

In [35]:
not_attrition_df
Out[35]:
Age Attrition Business_Travel Daily_Rate Department Distance_From_Home Education Education_Field Environment_Satisfaction Gender Hourly_Rate Job_Involvement Job_Level Job_Role Job_Satisfaction Marital_Status Monthly_Income Monthly_Rate Num_Companies_Worked Overtime Percent_Salary_Hike Performance_Rating Relationship_Satisfaction Stock_Option_Level Total_Working_Years Training_Times_Last_Year Work_Life_Balance Years_At_Company Years_In_Current_Role Years_Since_Last_Promotion Years_With_Current_Manager
1 49 0 Travel_Frequently 279 Research & Development 8 1 Life Sciences 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 No 23 4 4 1 10 3 3 10 7 1 7
3 33 0 Travel_Frequently 1392 Research & Development 3 4 Life Sciences 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Yes 11 3 3 0 8 3 3 8 7 3 0
4 27 0 Travel_Rarely 591 Research & Development 2 1 Medical 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 No 12 3 4 1 6 3 3 2 2 2 2
5 32 0 Travel_Frequently 1005 Research & Development 2 2 Life Sciences 4 Male 79 3 1 Laboratory Technician 4 Single 3068 11864 0 No 13 3 3 0 8 2 2 7 7 3 6
6 59 0 Travel_Rarely 1324 Research & Development 3 3 Medical 3 Female 81 4 1 Laboratory Technician 1 Married 2670 9964 4 Yes 20 4 1 3 12 3 2 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1465 36 0 Travel_Frequently 884 Research & Development 23 2 Medical 3 Male 41 4 2 Laboratory Technician 4 Married 2571 12290 4 No 17 3 3 1 17 3 3 5 2 0 3
1466 39 0 Travel_Rarely 613 Research & Development 6 1 Medical 4 Male 42 2 3 Healthcare Representative 1 Married 9991 21457 4 No 15 3 1 1 9 5 3 7 7 1 7
1467 27 0 Travel_Rarely 155 Research & Development 4 3 Life Sciences 2 Male 87 4 2 Manufacturing Director 2 Married 6142 5174 1 Yes 20 4 2 1 6 0 3 6 2 0 3
1468 49 0 Travel_Frequently 1023 Sales 2 3 Medical 4 Male 63 2 2 Sales Executive 2 Married 5390 13243 2 No 14 3 4 0 17 3 2 9 6 0 8
1469 34 0 Travel_Rarely 628 Research & Development 8 3 Medical 2 Male 82 4 2 Laboratory Technician 3 Married 4404 10228 2 No 12 3 1 0 6 3 4 4 3 1 2

1233 rows × 31 columns

In [36]:
stats.shapiro(attrition_df['Education'])
Out[36]:
ShapiroResult(statistic=0.8860442042350769, pvalue=2.277022537536255e-12)
In [37]:
stats.shapiro(not_attrition_df['Education'])
Out[37]:
ShapiroResult(statistic=0.8966962099075317, pvalue=5.311802897914952e-28)
In [38]:
stats.ttest_ind(attrition_df['Education'], 
                not_attrition_df['Education'], equal_var = False)
Out[38]:
Ttest_indResult(statistic=-1.2177493963259696, pvalue=0.22417128841924902)

Kruskal Test because it is a non-parametric test.

In [39]:
ss.kruskal(attrition_df['Education'], not_attrition_df['Education'])
Out[39]:
KruskalResult(statistic=1.3527640913093548, pvalue=0.2447954753326153)

pvalue > 0.05

There appears to be no statistically significant relationship between Attrition and Education, thus we fail reject the null hypothesis.

Hypothesis Testing

Test 2

Ho: µ Age_Attrition == µ Age_Not_Attrition

Ha: µ Age_Attrition != µ Age_Not_Attrition

In [40]:
stats.shapiro(attrition_df['Age'])
Out[40]:
ShapiroResult(statistic=0.9457082748413086, pvalue=1.0054400689796239e-07)
In [41]:
stats.shapiro(not_attrition_df['Age'])
Out[41]:
ShapiroResult(statistic=0.9782127737998962, pvalue=1.1040232313422327e-12)
In [42]:
stats.ttest_ind(attrition_df['Age'], 
                not_attrition_df['Age'], equal_var = False)
Out[42]:
Ttest_indResult(statistic=-5.828011853988949, pvalue=1.3797600649439775e-08)
In [43]:
ss.kruskal(attrition_df['Age'], not_attrition_df['Age'])
Out[43]:
KruskalResult(statistic=43.06268844023747, pvalue=5.3013684961038114e-11)

pvalue < 0.05

There appears to be a statistically significant relationship between Attrition and the Age, thus we can reject the null hypothesis.

Hypothesis Testing

Test 3

Ho: µ Job_Satisfaction_Attrition == µ Job_Satisfaction_Not_Attrition

Ha: µ Job_Satisfaction_Attrition != µ Job_Satisfaction_Not_Attrition

In [44]:
stats.shapiro(attrition_df['Job_Satisfaction'])
Out[44]:
ShapiroResult(statistic=0.8509283065795898, pvalue=2.3115119166623385e-14)
In [45]:
stats.shapiro(not_attrition_df['Job_Satisfaction'])
Out[45]:
ShapiroResult(statistic=0.8425453901290894, pvalue=2.8797148567139087e-33)
In [46]:
stats.ttest_ind(attrition_df['Job_Satisfaction'], 
                not_attrition_df['Job_Satisfaction'], equal_var = False)
Out[46]:
Ttest_indResult(statistic=-3.9261129248238826, pvalue=0.0001052049107397441)
In [47]:
ss.kruskal(attrition_df['Job_Satisfaction'], not_attrition_df['Job_Satisfaction'])
Out[47]:
KruskalResult(statistic=15.568947932935844, pvalue=7.955037680315368e-05)

pvalue < 0.05

There appears to be a statistically significant relationship between Attrition and Job Satisfaction, thus we can reject the null hypothesis.

Exporting Clean DF

In [48]:
clean_df.to_csv ('clean_IBM_HR.csv', header=True)